Visualizing Categorical Distributions

Reading in the data

import altair as alt
import pandas as pd

movies_extended = pd.read_csv('data/movies-extended.csv')
movies_extended
Title US Gross Worldwide Gross US DVD Sales Production Budget Release Date MPAA Rating Running Time min Distributor Source Major Genre Creative Type Director Rotten Tomatoes Rating IMDB Rating IMDB Votes
0 Boynton Beach Club 3127472.0 3127472.0 NaN 2900000.0 Mar 24 2006 R 104.0 Wingate Distribution Original Screenplay Romantic Comedy Contemporary Fiction NaN NaN NaN NaN
1 Broken Arrow 70645997.0 148345997.0 NaN 65000000.0 Feb 09 1996 R 108.0 20th Century Fox Original Screenplay Action Contemporary Fiction John Woo 55.0 5.8 33584.0
2 Brazil 9929135.0 9929135.0 NaN 15000000.0 Dec 18 1985 R 136.0 Universal Original Screenplay Black Comedy Fantasy Terry Gilliam 98.0 8.0 76635.0
3 The Cable Guy 60240295.0 102825796.0 NaN 47000000.0 Jun 14 1996 PG-13 95.0 Sony Pictures Original Screenplay Comedy Contemporary Fiction Ben Stiller 52.0 5.8 51109.0
4 Chain Reaction 21226204.0 60209334.0 NaN 55000000.0 Aug 02 1996 PG-13 106.0 20th Century Fox Original Screenplay Action Contemporary Fiction Andrew Davis 13.0 5.2 15817.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1185 Zombieland 75590286.0 98690286.0 28281155.0 23600000.0 Oct 02 2009 R 87.0 Sony Pictures Original Screenplay Comedy Fantasy Ruben Fleischer 89.0 7.8 81629.0
1186 Zack and Miri Make a Porno 31452765.0 36851125.0 21240321.0 24000000.0 Oct 31 2008 R 101.0 Weinstein Co. Original Screenplay Comedy Contemporary Fiction Kevin Smith 65.0 7.0 55687.0
1187 Zodiac 33080084.0 83080084.0 20983030.0 85000000.0 Mar 02 2007 R 157.0 Paramount Pictures Based on Book/Short Story Thriller/Suspense Dramatization David Fincher 89.0 NaN NaN
1188 The Legend of Zorro 45575336.0 141475336.0 NaN 80000000.0 Oct 28 2005 PG 129.0 Sony Pictures Remake Adventure Historical Fiction Martin Campbell 26.0 5.7 21161.0
1189 The Mask of Zorro 93828745.0 233700000.0 NaN 65000000.0 Jul 17 1998 PG-13 136.0 Sony Pictures Remake Adventure Historical Fiction Martin Campbell 82.0 6.7 4789.0

1190 rows × 16 columns

Bar charts are effective for visualizing categorical “distributions” of a single column

alt.Chart(movies_extended).mark_bar().encode(
    alt.X('count()'),
    alt.Y('Major Genre', sort='x'))

Stacked bar charts can visualize counts for combinations of two categorical columns

alt.Chart(movies_extended).mark_bar().encode(
    alt.X('count()'),
    alt.Y('Major Genre', sort='x'),
    alt.Color('MPAA Rating'))

Reordering the bar segments aligns it with the order in the legend

alt.Chart(movies_extended).mark_bar().encode(
    alt.X('count()'),
    alt.Y('Major Genre', sort='x'),
    alt.Color('MPAA Rating'),
    alt.Order('MPAA Rating'))

Rescaling the bar lengths facilitates comparing proportions between bars

alt.Chart(movies_extended).mark_bar().encode(
    alt.X('count()', stack='normalize', title='Proportion of movies'),
    alt.Y('Major Genre', sort='x'),
    alt.Color('MPAA Rating'),
    alt.Order('MPAA Rating'))

Sorting by the length of one of the coloured segments make the chart easier to read

sort_order = ['Adventure', 'Musical', 'Comedy', 'Romantic Comedy', 'Action',
              'Drama', 'Concert/Performance', 'Documentary', 'Western',
              'Thriller/Suspense', 'Horror', 'Black Comedy'] 
alt.Chart(movies_extended).mark_bar().encode(
    alt.X('count()', stack='normalize', title='Proportion of movies'),
    alt.Y('Major Genre', sort=sort_order),
    alt.Color('MPAA Rating'),
    alt.Order('MPAA Rating'))

Normalize stacked bar charts are effective at visualizing just a few categories

sort_order = ['Concert/Performance', 'Musical', 'Documentary', 'Adventure', 
              'Comedy', 'Romantic Comedy', 'Drama',  'Action']
alt.Chart(movies_extended[movies_extended['MPAA Rating'].isin(['G', 'PG'])]).mark_bar().encode(
    alt.X('count()', stack='normalize', title='Proportion of movies'),
    alt.Y('Major Genre', sort=sort_order),
    alt.Color('MPAA Rating'),
    alt.Order('MPAA Rating'))

Showing bars side by side makes it easier to compare their exact heights within a category

(alt.Chart(movies_extended).mark_bar().encode(
    alt.X('count()', title=''),
    alt.Y('MPAA Rating', title=''),
    alt.Color('MPAA Rating', legend=None))
 .properties(width=100, height=45)
 .facet('Major Genre', columns=4)
 .resolve_scale(x='independent'))

Switching the faceting and y column targets the plot towards a slightly different question

(alt.Chart(movies_extended).mark_bar().encode(
    alt.X('count()', title=''),
    alt.Y('Major Genre', title='', sort='x'),
    alt.Color('MPAA Rating', legend=None))
 .properties(width=100, height=150)
 .facet('MPAA Rating')
 .resolve_scale(x='independent'))

Heatmaps are effective for visualizing counts of two-dimensional categorical data

alt.Chart(movies_extended).mark_rect().encode(
    alt.Color('count()'),
    alt.X('MPAA Rating'),
    alt.Y('Major Genre', sort='color'))

Using both the colour and marker size to indicate the count creates a more effective visualization

alt.Chart(movies_extended).mark_circle().encode(
    alt.X('MPAA Rating'),
    alt.Y('Major Genre', sort='color'),
    alt.Color('count()'),
    alt.Size('count()'))

Let’s apply what we learned!